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BAYESIAN MODEL COMPARISON AND MODEL AVERAGING 
FOR SMALL-AREA ESTIMATION^ 

By Murray Aitkin, Charles C. Liu and Tom Chadwick 

University of Melbourne, University of Melbourne and Newcastle University 

This paper considers small-area estimation with lung cancer mor- 
tality data, and discusses the choice of upper-level model for the vari- 
ation over areas. Inference about the random effects for the areas may 
depend strongly on the choice of this model, but this choice is not 
a straightforward matter. We give a general methodology for both 
evaluating the data evidence for different models and averaging over 
plausible models to give robust area effect distributions. We rean- 
alyze the data of Tsutakawa [Biometrics 41 (1985) 69-79] on lung 
cancer mortality rates in Missouri cities, and show the differences in 
conclusions about the city rates from this methodology. 

1. The lung cancer data. The data are male lung cancer mortality fre- 
quencies and population sizes for the period 1972-1981 in = 84 Missouri 
cities (see Table 1). The variables, given in Tsutakawa and reproduced be- 
low, are the number r of men aged 45-54 dying from lung cancer in each 
city over the period 1972-1981 and the city size n. 

Most of the "cities" are small, though three are large. The mortality rates 
are poorly defined in small cities; four cities with populations less than 200 
have no deaths at all, so the observed rate is zero. Our aim is to estimate 
the mortality rates in each city, using the information from other cities in 
the most appropriate way. 

2. Small-area estimation. Variance component models are widely used 
in small-area estimation; the term borrowing strength is commonly used to 
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Table 1 

Male lung cancer mortality frequency and city size 1972-1981 in 84 Missouri cities 
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describe the improvement in precision of estimation for the parameters of 
smah areas by relating them through a two-level model. Empirical Bayes 
methods [Carlin and Louis (1996)] are widely used to represent the preci- 
sion of the small-area parameters through their estimated posterior distribu- 
tions, substituting the unknown variance component and other parameters 
by their maximum likelihood estimates. It has long been known [see, e.g., 
Tsutakawa (1985)] that this is not a satisfactory representation for the in- 
formation about the small areas because the imprecision in the estimation 
of the variance component and other parameters is not allowed for. Fully 
Bayesian methods allow for this correctly, and will be discussed here. 

The upper-level model for the among-area variation is routinely assumed 
to be normal on a suitable scale, though other choices, like the conjugate 
gamma distribution for Poisson rates, or the beta distribution for binomial 
proportions, are possible. With more than one possible model for rates, we 
need to compare the relative evidence for the competing models; if one is 
greatly superior to the others, we can discount the poorer models. A fur- 
ther issue is how to combine, or average over, the models if several are well 
supported by the data. 
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The comparison of upper-level models is frequently done using the De- 
viance Information Criterion DIG [Spiegelhalter et al. (2002)]. In this ap- 
proach the posterior distribution of the deviance is computed for each model, 
and the mean (or sometimes the median) of this distribution is penalized by 
a function of the number of model parameters; an important issue in this 
approach is the need to specify the focus of the likelihood, in the sense of 
identifying the level in the model at which inference is to be performed. 

This requirement was criticized by the paper discussants, and in more 
detail by Celeux et al. (2006); see also Trevisani and Gelfand (2003). In 
their response, Spiegelhalter et al. [(2002), page 634] said: 

"Smith and others ask how the model focus should be chosen in practice. 
We argue that the focus is operationalized by the prediction problem of in- 
terest. For example, if the random effects in a hierarchical model relate to 
observation units such as schools or hospitals or geographical areas, where 
we might reasonably want to make future predictions for those same units, 
then taking p{y\0) as the focus is sensible. The prediction problem is then to 
predict a new YJ^rep conditional on the posterior estimate of 6i for that unit. 
However, if the random effects relate to individual people, say, then we are 
often interested in population-average inference rather than subject-specific 
inference, so we may want to predict responses for a new or 'typical' individ- 
ual rather than an individual who is already in the data set. In this case, it is 
appropriate to integrate over the 9s and to predict Yrcp for a new individual 
conditional on [the random effect distribution parameters] tp, leading to a 
model focused on p{y\7p). A crucial insight is that a predictive probability 
statement such as piYrcply) is not uniquely defined without specifying the 
level of the hierarchy that is kept fixed in the prediction — this defines the 
focus of the model." 

In the context of the lung cancer data, these two cases would correspond 
to inferential statements about a given city rate, and inference about an 
average or typical city rate. 

In this paper we give a general methodology which treats these two cases 
in the same way, using the full posterior distribution of the Fisherian ob- 
served data likelihood (rather than a penalized version of its mean or me- 
dian) and the standard marginal and conditional arguments for the indi- 
vidual city random effect distributions and the marginal rate distribution 
across cities. 

We use this methodology to illustrate: 

• the differences in conclusions which may arise from different upper-level 
models; 

• the comparison of upper-level models; 

• and the form of model averaging which protects against such differences, 

for both individual city and "typical city" rates, with a re-analysis of the 
Missouri lung cancer data analyzed by Tsutakawa (1985). 
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Fig. 1. Posterior densities for five cities from Tsutakawa. 

3. The Tsutakawa analysis. Tsutakawa gave empirical Bayes and ap- 
proximate fully Bayes analyses of these proportions, and showed a slight 
increase in dispersion of the fully Bayes posterior distributions for small 
cities. He used a nonconjugate normal model for the logit of the cancer 
rates which required several approximations for the computations. 

His Figure 1, reproduced above, shows the empirical Bayes and full Bayes 
posteriors for the logits of the death rates for five of the cities, numbered 1, 
8, 17, 83 and 84 in the data table. 

The full Bayes posteriors are slightly more diffuse than the empirical 
Bayes posteriors, particularly for city 1 with its small population. 

We note for subsequent comparison that Bayes inferences about the city 
rates may be based directly on the data from each city, without (apparently) 
any modeling: with a flat prior on the rate pi in the ith city, and the observed 
death count rj in the population of rii, the ith city death rate has a posterior 
Beta{ri + 1, nj — rj + 1) distribution. (We do not use the Jeffreys or Haldane 
priors, as these give improper posteriors for the cities with zero cases.) 

These distributions for the same five cities, shown on the logit scale in 
Figure 2, are quite different from the empirical or the full Bayes posteriors 
in Figure 1 for the small cities, though very similar for the larger cities. This 
is an example of "posterior shrinkage" from the normal model assumption 
for the distribution of the city random effects on the logit scale. 

Other possible models need to be examined for the death rates and com- 
pared to the normal model, since posterior shrinkage may be quite different 
for different models [Aitkin (1999)]. We set out a general methodology for 
this purpose, following Dempster (1974, 1997), Aitkin (1997) and Aitkin, 
Boys and Chadwick (2005); see also Fox (2005). 
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Fig. 2. Posterior densities from each city data only. 

4. Model comparison via posterior deviances. We have competing mod- 
els Mj (j = 1,..., J) for observed data y, with densities fj{y\6j) under 
model Mj, and prior distributions 7rj{9j). Given the data y, we form the 
likelihoods Lj{6j\y), and update the priors to the posteriors T^j{0j\y)- We 

make T independent draws (both within and among models) 0^*' from the 

posteriors, and substitute these into the likelihoods, to give Lj{9^j^). These 
are T independent draws from the posterior distributions of the likelihoods 
Lj(9j\y). To compare models j and k, we compute the likelihood ratio values 

L^*! = Lj{9^j^)/Lk{9^^'^), which are T independent draws from the marginal 
posterior distribution of Lj^. 

A likelihood ratio Ljk of 9, with equal prior probabilities of 0.5 on each 
model, would give a posterior probability of 0.9 for model j, which would 
generally be regarded as quite strong evidence for model j over model k. 
Over the T draws L^*|, we compute the (empirical) probability that Ljk > 9; 
if this is 0.9 or more, we have a high posterior probability of quite strong 
evidence in favor of model j over model k. 
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It is easier to compute and interpret deviances (unfortunately easily con- 
fused with frequentist deviances). The deviance for model Mj is Dj = 
-2logLjiej). 

For the tth draw 6^j^ from the posterior of 6j, we obtain the tth draw of 
the deviance for model Mj: 

Df = -2logL,{9f). 

A likelihood ratio Ljk of 9 is equivalent to a deviance difference Djk = 
Dj - Dk of -2 log 9 = -4.4. If the empirical Pr[L>jfc < -4.4|data] > 0.9, we 
have a high posterior probability of quite strong evidence for Mj against 
Mfc. We extend this approach to model averaging in Section 5. 

A formal analysis follows Spiegelhalter et al. (2002). For large samples 
from regular models with flat priors on the parameters 9, a Taylor expansion 
of D{6) about 9 shows that the deviance can be expressed [equation (18) of 
Spiegelhalter et al. (2002)] as 

D{9)o^Die) + xl 

where s is the dimensionality of 9, and D{9) is the frequentist deviance. (For 
nonnested model comparisons, all integrating constants must be included in 
the model likelihood.) If the competing deviance distributions are plotted 
as cdfs on the deviance scale, those with more parameters will have lower 
slopes because of the increasing variance of the Xs distribution with s. 
For comparison of nonnested models 1 and 2, the deviance difference 

Di2 = Di{9i) - D2{92) ^ Di{9i) - 1)2(^2) + xl - xl,, 

where the x? variables are independent. Thus, the distribution of the de- 
viance difference D12 will be a difference of independent variables location- 
shifted by the frequentist deviance difference FD12. The distribution of a 
difference between independent variables does not have a closed-form 
density, but is very easily simulated. 
If both si and S2 are large, 

D12 ~ si - S2 + \/2iTAf(0, 1) - v^A^(0, 1) + FD12, 

that is, 

D12 ^Si-S2 + iV(0, 2si + 2S2) + ^12- 

Here the frequentist deviance difference is penalized by the difference in 
degrees of freedom, but the uncertainty in the comparison — which increases 
with increasing numbers of parameters — provides a probabilistic measure of 
certainty about the superiority of one model over the other. This is different 
from the usual penalized LRT approaches which provide only single-number 
decision criteria without a probabilistic interpretation. 
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The likelihoods being compared are the observed-data likelihoods, obtained 
by integrating out the random effects. Their computation will in general 
require numerical integration or MCMC to obtain the parameter posteriors. 
For the lung cancer data there are no covariates, so a direct integration 
approach is straightforward. Since there are no individual-level variables, 
each city population is effectively homogeneous, and the two-level variance 
component model reduces to an overdispersion model in the proportions pi. 
We first adapt the original analysis by Tsutakawa. 

4.1. Normal logit analysis. We depart slightly from Tsutakawa's analysis 
by using the binomial distribution for the number of deaths in each city 
rather than the Poisson distribution; since the observed rates are very small, 
and the city populations 100 or more, this makes a negligible difference, as 
we use the same upper-level normal logit model as Tsutakawa. 

The number of deaths is modeled by a binomial random variable Ri 
with true death proportion pi : 

Ri I Pi,ni ~ b{ni,pi). 

At the upper level, the logits of the city proportions are modeled by a non- 
conjugate normal distribution: 



log 



Pi 



\-Pi. 

To compute the likelihood Li{fi,a), we integrate over the unobserved 9i us- 
ing a (K =)20-point Gaussian quadrature, with masses vr^. at mass-points z^, 
k = l,...,K: 

^ 'n,\ e^^^^ 1 r l{9i-iJif 



=1 



^ fnA e(^+^^')''' 1 r 1 



=1 



N , . K 



n 



We compute the likelihood numerically over an equally spaced grid of 
G = 100 X 100 values (/U[g],cj[g]) in the region of appreciable likelihood: /i G 
(—4.56, —4.20), o" G (0.08,0.40), sum the likelihoods over the grid and nor- 
malize to give the posterior mass function 7r(/i,cj|y) of (/U,cj) for flat priors 
on the grid, as shown in Figure 3. 

The maximum likelihood estimates of /z and o", and the maximized log- 
likelihood over the grid are /i = — 4.787, <t = 0.2368 and —181.254; the fre- 
quentist deviance is 362.51. These discrete MLEs are close, but not identical. 
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Fig. 3. Joint posterior mass function of [^,0). 



to those reported by Tsutakawa: fi = —4.733, a = 0.2384. The figure strongly 
suggests multi-modahty in the distribution, but this causes no difficulty in 
the subsequent analysis. Careful inspection shows that Tsutakawa's esti- 
mates are near a saddle point, and that the likelihood has four local maxima, 
with values of (^,cr) and log-likelihoods given by (-4.787,0.2368,-181.254), 
(-4.708,0.2016,-181.397), (-4.776,0.3072,-181.425), (-4.668,0.2432, 
—181.612). The log-likelihood at Tsutakawa's estimates is —182.059. These 
small differences have very little effect on the area posterior densities. 

We note for later discussion that, by weighting —2 times the log- likelihoods 
^i(//[g], (T[g]) at each grid point by 7r(//[g], crj^j |y), their posterior probabilities, 
and summing, we obtain the posterior mean of the deviance used in the DIG 
[Spiegelhalter et al. (2002)]. For the normal model the posterior mean de- 
viance is 364.32. This may be combined with the maximized log-likelihood 
to give one version of the effective number of parameters po defined by 
equation (10) of Spiegelhalter et al. (2002): 



PD = Die)-D{9y, 




we use for the second term the deviance at the MLEs rather than at the 
posterior mean, since the latter is not a grid point in general and would 
require additional computation. Here the second term is 362.51, giving pu = 
1.81, and D/Ci = 366.13. 

However, Figure 3 provides much more information, since it gives the exact 
(up to the grid resolution) posterior distribution of the deviance, without 
any Monte Carlo simulation. Sorting the deviances 

into increasing order with their corresponding posterior probabilities and 
cumulating the latter, we obtain the cdf of the deviance, as shown in Figure 
4. 

Conditional on the ith. area data rj,nj and the parameters /u,cr, the pos- 
terior density of pi has the form 

7r(e, I a, r,) = c(/i, a) ( ) (1 - p.r-'-' expj - i ^^i^ } 
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Expanding the log density in Qi about Bi = log [rj/(nj — rj)] up to second- 
order terms, it is easily shown that, to this order of approximation, 



9i \ ri,ix,a N 



where 'i/'i and ip are the sample and prior precisions of 9i: 

iii=nipi{l-pi), V=l/c^^- 

This approximation fails if = 0; in this case we set rj = 0.5. The level of 
approximation is not affected by the distribution of 

The posterior densities for the area random effects pi are most easily 
computed by Gaussian kernel smoothing of T = 10,000 random values , 
generated from the normal distributions obtained by substituting T random 
draws (/it*l,(TW) into the area posterior distributions: 

where ^t*' = l/a^^ and z'*] is a random draw from iV(0, 1). 

Figure 5 shows the results for the five cities above for T = 10,000. 

Comparison with Tsutakawa's Figure 1 shows close agreement for all five 
cities, though the vertical scales are different. 

We now re-analyze the data with a beta distribution for the area propor- 
tions, which gives simply computed forms for the likelihood and the posterior 
distributions. 

4.2. Beta-binomial analysis. At the upper level, the city proportions are 
modeled by a conjugate beta distribution: 

Pj I a, 6 ~ Beta{a, b), 

with density function 

f{p)=p''-\l-p)'-'/B{a,b), a,b>0, 
where B{a,b) is the complete beta function 

B{a,b) = r{a)r{b)/r{a + b). 
The beta-binomial likelihood is denoted by 



^2(a,6) = n 



i=l 



B{ri + a,ni-ri + b)/B{a,b) 



BAYESIAN MODEL COMPARISON 



11 




-Si -5.0 -4.5 -tJ} -3* 



Fig. 5. Normal model posterior densities for five cities. 

The likelihood in this parametrization is very highly correlated in {a,b); we 
use instead the (mean, standard deviation) parametrization, with 

fif3 = a/{a + b), ap = ^Jah/[{a + bf{a + 6 + 1)]. 

The likelihood, shown in Figure 6, is nearly orthogonal in these parameters. 
We use this form of the likelihood for subsequent computation. 

We compute the likelihood numerically over a grid of G = 100 x 100 values 
{lJ'l3[g]-,^(3\g\) ^ the region of appreciable likelihood: G (0.007, 0.011), S 
(0.001,0.004), sum the likelihoods over the grid and normalize to give the 
posterior mass function Ti{^p.,ap\y) of (fip^ap) on the grid. The maximized 
log-likelihood is —181.486, with frequentist deviance 362.97. 

As for the normal model, by weighting the log-likelihoods at each grid 
point by —2 times their posterior probabilities and summing, we obtain the 
posterior mean of the deviance for the beta model, which is 364.99, slightly 
larger than that for the normal model, of 364.32. The effective number of 
parameters is po = 2.02, and DIC2 = 367.01. 
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Fig. 6. Joint posterior mass function of beta mean and SD. 

The full posterior distribution of the beta deviance D2{^p., (Jp) = — 21ogL2(/i^, 
CT/j), computed as for the normal model, is shown (dotted curve) in Figure 
7, together with that of the normal model 1 (solid curve). We discuss the 
comparison of these models in the next section. 

Conditional on the ith area data rj,nj and the parameters a, 6, the pos- 
terior distribution of pi is again beta: 

7r(pi I a An) =pT-^''~\l-pir-''+''~^/B{ri + a,n, -n + b). 

The posterior densities for the area random effects pi are again computed 
by Gaussian kernel smoothing of T = 10,000 random values pf^, generated 
from T random draws (/ij^^ , cr^^ ) from their posterior distribution which are 
converted to T random values (a'*],^'*!) of {a,h). We finally draw the T 
random values of pi, one each from the T beta distributions with indices 
+ aP'\ni — ri + 6^ for the given i. 

The T values of pi are transformed to the logit scale for ease of inspection 
and consistency with Tsutakawa's analysis; posterior densities for individual 




BAYESIAN MODEL COMPARISON 13 

I I I I I- 

W - 
OS 
OM - 
0.7 
OS 

DA - 
03 - 
02 - 

(■jO 

360 3ES 370 a7S 3» 

Fig. 7. Beta (dotted) and normal (solid) deviances. 

cities are then computed using Gaussian kernel densities with bandwidths 
chosen to give smooth densities. Figure 8 shows the five beta posteriors 
(dotted curves) together with the normal posteriors from Figure 6 (solid 
curves). The city numbers are placed at the intersection of the two densities. 

The beta posteriors are slightly less concentrated than the normal poste- 
riors except for city 84, and show slightly more shrinkage toward the mean. 

Since the posterior conclusions from the beta distribution differ somewhat 
from those from the normal, we need to decide whether the data support 
one model over the other. 

5. Model comparisons via deviances. For the comparison of these mod- 
els, and of other models for the true proportions, we compare the model 
deviances D = — 21og-L. 

The two models for the pi considered above are not the only possible mod- 
els. Following Spiegelhalter et al. (2002), we consider four possible models 
for the Pi: 
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Fig. 8. Beta (dotted) and normal (solid) posterior densities for five cities. 

• Model 1 — the normal logit model: logitpj ~ A^(^, cr^); 

• Model 2 — the beta model: pi ~ B(^a,b); 

• Model 3 — the null model: pi = p; 

• Model 4 — the saturated model: pi all different and unrelated. 

Models 3 and 4 are different from Models 1 and 2 in that there is no actual 
parametric model for the variation of the pi across the cities — each city has 
its own single parameter under Model 4, and all cities have the same single 
parameter under Model 3. 

We compare directly the deviance distributions under each model. We 
first apply this approach to the normal and beta deviance distributions; we 
show all distributions in Figure 10. The beta deviance has a lower slope and 
is consistently to the right of the normal deviance, so the normal model is 
preferred, but not very strongly. To compare the models directly we draw 
10,000 random values Df\D2^ from each deviance distribution and com- 
pute the deviance differences D^l = — D^^ (beta — normal) from the 
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10,000 (unordered) values for each model. The cdf of the deviance difference 
distribution is shown in Figure 9. 

The distribution has its median at 0.505, and the 95% credible interval 
for the true difference, computed from the 250th and 9750th ordered differ- 
ences [Congdon (2005)] is (—5.125,6.378). The estimated probability that 
the normal deviance is smaller than the beta is 0.6332: we cannot confidently 
choose between these models. 

The deviance constructions for Models 3 and 4 are different from those 
for Models 1 and 2. The null Model 3 likelihood given p is 

m 
i=l 



p«(i-p) 



where R = Y.iri,N = Y,^ni. 
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Fig. 10. Deviances for null (dashed), normal (solid), beta (dotted) and saturated (dot- 
dashed) models. 



For this one-parameter model, which is effectively a single-sample model, 
the prior represents uncertainty about the value of p, not the variability of 
different pi across cities. We use a uniform prior for giving a Beta{R + 
l^N — R + 1) posterior distribution of p. We draw 10,000 random values 
from this posterior, and for each compute the likelihood L3(pW) and the 
corresponding deviance D^^ . 

For the saturated Model 4, the prior specification is similar. Each city 
has its own pi, unrelated to the others, so the uniform prior is used for 
each city independently, giving the set of Beta{ri + l,ni — ri + 1) posterior 

distributions oi pi. We draw 10,000 random values from each posterior, 

and for each m and each i compute the likelihood contribution L4j(p|*') and 

the corresponding deviance D^^ = — 2X]ilog-Z>4j(pf' )• 

The cdfs of all four deviance distributions are shown in Figure 10. 
Model 3 is nearly 40 deviance units to the right of the normal and beta — 

it is immediately clear that the null model is untenable. We exclude it from 
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Fig. 11. Deviance difference normal — saturated model. 

further consideration. The deviance distribution for the saturated model 
crosses those for the normal and beta models, indicating no strong preference 
for the saturated model over the parametric models. For the better-fitting 
normal model, the distribution of the deviance difference (normal deviance — 
saturated deviance) is shown in Figure 11. 

Of the 10,000 differences, 2216 are positive, so the estimated probability 
that the saturated deviance is smaller is 0.7784. This is not compelling 
evidence against the normal model; since the saturated model allows no 
"borrowing of strength" across cities, we retain both the normal and beta 
models as candidates for the upper-level distribution, in addition to the 
saturated model. 

6. Posterior model averaging. Model averaging is frequently proposed 
[see, e.g., Hoeting et al. (1999)] for posterior inference about a common 
parameter (like the mean) across several competing models. The posterior 
densities under each model are averaged with respect to the posterior prob- 
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abilities of each model, based on their marginal likelihoods integrated over 
the prior distributions of the parameters. 

This process uses the integrated likelihoods; as with model choice, we 
use the posterior distributions of the likelihoods to provide an averaged 
posterior density, but with respect to posterior probabilities of each model 
which are themselves random variables rather than fixed constants. The 
model comparison approach of Section 3 is simply extended to deal with 
this. 

As before, 9j are the model parameters for model Mj, 6^j^ are the T 

draws from the posterior distribution of 6j given the data y, are the 

corresponding T draws from the posterior distribution of Lj, and pfj are 
the corresponding T values of the random effect pi for area i under Mj. 

Let TTj be the prior probability of Model j; the tth. draw 7r[*](Mj|y) from 
the posterior probability of Model j is then 

niHM,\y) = .,Lf/Y.n,Lf. 

k 

The averaged density Pave,i for pi is then defined through the T simulated 
values j produced by the algorithm: for each i and each m: 

1. compute -7rW(Mj|y); 

2. with probability 7rW(Afj|y), set p^jf, j 

This looks complicated notationally but is quite simple: the likelihoods 
for each draw define the posterior probabilities for each model for this draw, 
and we simply choose the model j random effect draw with the model j 
posterior probability draw [Congdon (2005, 2006)]. 

In the computation of the averaged density we exclude the null model, 
and include the other three. Prior probabilities for the three distributions 
are taken as equal, though as described above it is a simple matter to change 
them. We show below several graphs of the city posterior rate densities on 
the logit scale for the five cities shown earlier. Figure 12 shows the averaged 
densities, Figure 13 shows the averaged (solid) and local area (dot-dashed) 
densities from Figure 2, and Figure 14 shows the averaged (solid) and normal 
(dot-dashed) densities. 

The sharpness of the normal area posteriors is damped by the averaging 
process, because in the 10,000 draws from each posterior deviance distribu- 
tion, the saturated model likelihood is generally superior to the normal or 
beta likelihood. For the large city 84 there is little difference in the posteriors 
because of the very large city size; this city "lends strength" to the small 
cities in the normal model which are substantially shrunk toward the poste- 
rior mean. The generally superior likelihood for the saturated model shows 
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Fig. 12. Averaged posteriors, five cities. 

that neither the normal nor the beta provides a convincing representation of 
the random effect distribution, and so conclusions about individual random 
effects need to be based substantially on the local area rate. 

7. Model averaging for a "typical" city. The analysis above assumes 
that our inferential interest is in the individual city rates. This is the most 
common use of the two-level model. If, however, (or in addition) we wish 
to draw conclusions about a "typical" city, this is done using the among- 
city model — either normal or beta. The saturated model does not provide 
information about a "typical" city, because the city rates are unrelated under 
this model. 

The definition of a "typical" city is unclear; we take for illustration a 
random city, in the sense of a random draw from the among-city model. 
Under the normal model, the random logit rate will be ^ + aZ, where Z ~ 
A^(0, 1), while under the beta model for p, it will be a random draw from 
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Fig. 13. Averaged (solid) and local (dot-dashed) posteriors, five cities. 



the transformed beta density 



fiO) 



(l + e« 



1+6 ■ 



So the posterior distribution of the random city rate under the normal model 
is given by the empirical distribution of the T random draws /i'*] +(tWzW, 
while that for the beta model is that of the T random draws 0^*' from 
5eta(oW,6M). To model average these, we draw the normal 
with posterior probability 7rW(A^|y), and the beta 9^^'^ with posterior prob- 
ability 1 — 7rM(A^|y), where only these two models have positive posterior 
probability. 



8. Discussion. These conclusions may seem surprising. The idea of "bor- 
rowing strength" is well established in the Bayesian and non-Bayesian lit- 
erature for random effect models. The difficulty with the lung cancer data 
is that the "strong" cities from which strength may be borrowed have es- 
sentially only two support points, since cities 4 and 44 have observed rates 
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(0.00742 and 0.00867) which are very similar and near the median observed 
rate (though nearly the smallest of the normal posterior mean rates) , while 
city 84 has almost the highest observed rate (0.01484), and has the highest 
normal posterior mean rate. 

The small cities with limited data cannot shrink effectively toward either 
support point, since their rates are not well identified with these points, 
and the generally higher likelihood for the saturated model accentuates this 
effect. The model averaged rate distributions remain diffuse, though less so 
than the single city rates. 

What has been missing in the use of these models is a direct and straight- 
forward assessment of the appropriateness, or goodness of fit, of the upper- 
level model. This has been done in Bayesian analysis mostly through Bayes 
factors, with their attendant prior sensitivity difficulties, or through the DIG, 
with its definitional "focus" difficulties and ambiguous penalty. 

The comparison of models through their posterior likelihood or deviance 
distributions provides a straightforward way, not only to compare different 
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parametric models, but also to evaluate the quality of reproduction of the 
local area rates by the parametric models. 

It is not, however, necessary to "pick the best model": the posterior like- 
lihood model comparison extends directly to Bayesian model averaging, and 
provides a compromise among the well-supported competing models. In this 
approach we treat inference about the "typical" city rate and about the 
individual city rates in the same way, though the relevant distributions are 
marginal instead of conditional. 

9. Conclusion. The comparison of deviance distributions allows the ex- 
pression of model comparisons through likelihood ratios, as for simple null 
and alternative hypotheses, while allowing automatically for the complexity 
of the model: unnecessarily complex models have more diffuse likelihoods 
and so the approach of explicit penalization of the maximized likelihood 
through some function of the number of model parameters is not neces- 
sary. Incorporation of informative parameter or model priors is simple and 
straightforward. A further example of deviance distribution comparisons on 
the galaxy data of Roeder (1990) is given in the discussion of Ridall et al. 
[(2007), pages 264-265] by Aitkin. 

The choice of a suitable upper-level model for this example is informed by 
Tsutakawa's original analysis of the data; for this simple data set the beta 
distribution provides an alternative model which performs nearly as well as 
the normal model, but both these models are (nonsignificantly) inferior to 
the saturated model, despite the diffuseness of its deviance distribution from 
the 84 cities. 

This analysis is simply adapted to provide model-averaged posterior shrink- 
age when several upper-level models are well supported by the data; these 
model-averaged posteriors provide a compromise reflecting the uncertainty 
about the appropriate model. 
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